Understanding and visualizing the data
🤓 So we have this table:
| Asset id | runtime | setting_1 | setting_2 | setting_3 | Tag1 | Tag2 | Tag3 | Tag4 | Tag5 | ... | Tag12 | Tag13 | Tag14 | Tag15 | Tag16 | Tag17 | Tag18 | Tag19 | Tag20 | Tag21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | -0.0007 | -0.0004 | 100.0 | 518.67 | 641.82 | 1589.70 | 1400.60 | 14.62 | ... | 521.66 | 2388.02 | 8138.62 | 8.4195 | 0.03 | 392 | 2388 | 100.0 | 39.06 | 23.4190 |
| 1 | 1 | 2 | 0.0019 | -0.0003 | 100.0 | 518.67 | 642.15 | 1591.82 | 1403.14 | 14.62 | ... | 522.28 | 2388.07 | 8131.49 | 8.4318 | 0.03 | 392 | 2388 | 100.0 | 39.00 | 23.4236 |
| 2 | 1 | 3 | -0.0043 | 0.0003 | 100.0 | 518.67 | 642.35 | 1587.99 | 1404.20 | 14.62 | ... | 522.42 | 2388.03 | 8133.23 | 8.4178 | 0.03 | 390 | 2388 | 100.0 | 38.95 | 23.3442 |
| 3 | 1 | 4 | 0.0007 | 0.0000 | 100.0 | 518.67 | 642.35 | 1582.79 | 1401.87 | 14.62 | ... | 522.86 | 2388.08 | 8133.83 | 8.3682 | 0.03 | 392 | 2388 | 100.0 | 38.88 | 23.3739 |
| 4 | 1 | 5 | -0.0019 | -0.0002 | 100.0 | 518.67 | 642.37 | 1582.85 | 1406.22 | 14.62 | ... | 522.19 | 2388.04 | 8133.80 | 8.4294 | 0.03 | 393 | 2388 | 100.0 | 38.90 | 23.4044 |
5 rows × 26 columns
We have 20631 observations, each one related to an asset execution.
Let's visualize the number of operations performed for each asset and observe some general statistics about it.
| runtime | |
|---|---|
| count | 100.000000 |
| mean | 206.310000 |
| std | 46.342749 |
| min | 128.000000 |
| 25% | 177.000000 |
| 50% | 199.000000 |
| 75% | 229.250000 |
| max | 362.000000 |
The statistics suggest that, on average, assets last between 199 and 209 runtimes. We can also see that the asset that failed the earliest was consumed only 128 times, and the longest was used 362 times.
The second layer of the dataset is the Settings used to run each asset, let's see:
| setting_1 | setting_2 | setting_3 | |
|---|---|---|---|
| count | 20631.000000 | 20631.000000 | 20631.0 |
| mean | -0.000009 | 0.000002 | 100.0 |
| std | 0.002187 | 0.000293 | 0.0 |
| min | -0.008700 | -0.000600 | 100.0 |
| 25% | -0.001500 | -0.000200 | 100.0 |
| 50% | 0.000000 | 0.000000 | 100.0 |
| 75% | 0.001500 | 0.000300 | 100.0 |
| max | 0.008700 | 0.000600 | 100.0 |
As we can see from the standard deviation, there are no considerable changes in the settings pattern. This suggests (and only suggests as we don't have more details on what exactly these settings are) that the assets are used within a stable production line with little variation.
Last but not least we have all the 21 tags representing the reading of each monitoring sensor of the asset, let's take a look:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Tag1 | 20631.0 | 518.670000 | 0.000000e+00 | 518.6700 | 518.6700 | 518.6700 | 518.6700 | 518.6700 |
| Tag2 | 20631.0 | 642.680934 | 5.000533e-01 | 641.2100 | 642.3250 | 642.6400 | 643.0000 | 644.5300 |
| Tag3 | 20631.0 | 1590.523119 | 6.131150e+00 | 1571.0400 | 1586.2600 | 1590.1000 | 1594.3800 | 1616.9100 |
| Tag4 | 20631.0 | 1408.933782 | 9.000605e+00 | 1382.2500 | 1402.3600 | 1408.0400 | 1414.5550 | 1441.4900 |
| Tag5 | 20631.0 | 14.620000 | 1.776400e-15 | 14.6200 | 14.6200 | 14.6200 | 14.6200 | 14.6200 |
| Tag6 | 20631.0 | 21.609803 | 1.388985e-03 | 21.6000 | 21.6100 | 21.6100 | 21.6100 | 21.6100 |
| Tag7 | 20631.0 | 553.367711 | 8.850923e-01 | 549.8500 | 552.8100 | 553.4400 | 554.0100 | 556.0600 |
| Tag8 | 20631.0 | 2388.096652 | 7.098548e-02 | 2387.9000 | 2388.0500 | 2388.0900 | 2388.1400 | 2388.5600 |
| Tag9 | 20631.0 | 9065.242941 | 2.208288e+01 | 9021.7300 | 9053.1000 | 9060.6600 | 9069.4200 | 9244.5900 |
| Tag10 | 20631.0 | 1.300000 | 0.000000e+00 | 1.3000 | 1.3000 | 1.3000 | 1.3000 | 1.3000 |
| Tag11 | 20631.0 | 47.541168 | 2.670874e-01 | 46.8500 | 47.3500 | 47.5100 | 47.7000 | 48.5300 |
| Tag12 | 20631.0 | 521.413470 | 7.375534e-01 | 518.6900 | 520.9600 | 521.4800 | 521.9500 | 523.3800 |
| Tag13 | 20631.0 | 2388.096152 | 7.191892e-02 | 2387.8800 | 2388.0400 | 2388.0900 | 2388.1400 | 2388.5600 |
| Tag14 | 20631.0 | 8143.752722 | 1.907618e+01 | 8099.9400 | 8133.2450 | 8140.5400 | 8148.3100 | 8293.7200 |
| Tag15 | 20631.0 | 8.442146 | 3.750504e-02 | 8.3249 | 8.4149 | 8.4389 | 8.4656 | 8.5848 |
| Tag16 | 20631.0 | 0.030000 | 1.387812e-17 | 0.0300 | 0.0300 | 0.0300 | 0.0300 | 0.0300 |
| Tag17 | 20631.0 | 393.210654 | 1.548763e+00 | 388.0000 | 392.0000 | 393.0000 | 394.0000 | 400.0000 |
| Tag18 | 20631.0 | 2388.000000 | 0.000000e+00 | 2388.0000 | 2388.0000 | 2388.0000 | 2388.0000 | 2388.0000 |
| Tag19 | 20631.0 | 100.000000 | 0.000000e+00 | 100.0000 | 100.0000 | 100.0000 | 100.0000 | 100.0000 |
| Tag20 | 20631.0 | 38.816271 | 1.807464e-01 | 38.1400 | 38.7000 | 38.8300 | 38.9500 | 39.4300 |
| Tag21 | 20631.0 | 23.289705 | 1.082509e-01 | 22.8942 | 23.2218 | 23.2979 | 23.3668 | 23.6184 |
It's a little messy to analyze all those numbers, let's make some viz
👀 So, what can we intuit from the combination of the statistical summary and the visualization of the sensors for each asset?
Tag1, Tag10, Tag18 and Tag19) along the data, as we can see by the recurrence of one same value with no standard deviation associated, so maybe we can drop them out to preserve only useful information and prevent further model overfitting.To advance the analysis and the possibility of prediction, we can use the concept of remaining useful life - RUL. It will work as a coefficient to indicate how many cycles each asset has to failure, like a countdown. To calculate it is simple: as we already have information about the useful life of each of the assets, we just have to invert the count.
And we have the following distribution:
Or, we can also work with the RUL to better visualize the correlation between the sensor data and the failure.
We can also observe which sensors do not add relevant information to the construction of a linear model.
✅ There is a normalized distribution for the assets, indicating that we can make a statistical inference on the lifetime of the equipment.
✅ Settings have little or no impact on building a predictive model.
✅ We can use RUL calculation to correlate variables and predict future failure.
✅ Some sensors present irrelevant data for the construction of a linear model, which we will discard at first.
There are two possible approaches within predictive models: regression and classification.
Classification modeling is intended to signal whether the predicted value will be within a specific label. Let's start with him!
In our case, the first need is to predict whether the asset will be within the final 20 cycles before failure, a kind of indicator of the equipment's health.
Here we will signal that the measurements for the final 20 cycles are marked with the number 0, and when it passes this threshold it will change to 1.
🤖 Doing some machine learning magic 🤖
precision recall f1-score support
0 0.84 0.83 0.83 375
1 0.98 0.98 0.98 3752
accuracy 0.97 4127
macro avg 0.91 0.91 0.91 4127
weighted avg 0.97 0.97 0.97 4127
Excellent job! With a simple logistic regression we can recognize a failure 20 cycles in advance with 98% certainty! 😯
What if we wanted not only to know if the asset is healthy or unhealthy, but to know more precisely how many cycles it is at the end of its useful life?
This is where regression comes in, not just working with labels but with numbers more specifically.
Let's run it! 🏃
RMSE:44.33391060979522, R2:0.5698000071522076
R² can be understood, in a very superficial way, as the explanatory level of the model around the variable that is our target. And the RMSE is the margin of error, in number of runtimes of the "RUL".
So, a simple linear regression model can explain around 56.98%, with a margin of error of up to 44 cycles. 🥶
It doesn't seem very accurate, let's see if we can get better results.
Here we can apply a more accurate understanding of what the RUL is from the exploratory analysis, when we observe the degradation in the sensors.
📉 RUL is not linear all the time! It happens at some point in time.
As we do not have more information about what the measurement of the sensors is, I will apply that the regression only starts to be done from the 120th cycle.
Annnd ta dããã!
RMSE:18.763818622288873, R2:0.7729827657293247
The model has a considerable improvement, being able to explain 77% of the variance of the RUL, with 18 cycles of error margin! 🥳
XGBoost is a much more refined and complex model, with great performance for statistical problems, let's try it.
RMSE:18.33642899230519, R2:0.7832066791831965
We have a improvement in R² (accompanied by a reduction in RMSE). But the great advantage of working with XGBoost is using its hyperparameters, let's run a random search in search of the best values for these parameters:
Here we go!
RMSE:17.38761409374267, R2:0.8050620649150194
And again we can see an improvement in efficiency! 🤩
This is our final model: 80% explanation, with a margin of error of 17 cycles. 👌
From this model I will make the prediction for the test dataset and save these values in a .csv file for proper comparison with the true values.